perm filename PAPER[X,ALS] blob
sn#086472 filedate 1974-02-12 generic text, type T, neo UTF8
00100 The Amanuensis Speech Recognition System, under development
00200 at the Stanford Artificial Intelligence Laboratory attempts to
00300 extract the maximum linguistic information from the acoustic signal
00400 of continuous speech. While it is now generally recognized that a
00500 complete speech understanding system must make use of many forms of
00600 knowledge, syntactic, semantic and contextual, in addition to to
00700 information contained in the actual acoustic wave form, it is also
00800 obvious that the over-all performance of a speech understanding
00900 system depends critically upon the use that is made of the acoustic
01000 input.
01100
01200
01300 It is our belief that recent developments in computer
01400 hardware and in our ability to make effective use of this hardware
01500 has reopened the question of the extent to which the acoustic input
01600 can be utilizied and that it is wise for there to be some expanded
01700 effort along these lines. Accordingly, we have directed our efforts
01800 to what might be called the front end. We are attempting to
01900 demonstrate the extent to which acoustic information alone can be
02000 used in solving the general speech recognition problem. At the same
02100 time the work of others will demonstrate the extent to which higher
02200 sources of knowledge can compensate for deficiencies in the acoustic
02300 wave and in our ability to abstract significant linguistic
02400 information from it. Ultimately the two approaches can be combined to
02500 achieve a level of performance that could not be achieved with either
02600 approach alone.
02700
02800
02900 The Amanuensis approach differs from the earlier acoustic
03000 systems and from the front-end approaches currently being utilized by
03100 rest of the ARPA community in a number of important respects. In the
03200 first place we make no simplifying assumptions regarding the
03300 uniqueness of a phonemic event. As everyone knows the phonemes of
03400 real speech are not isolated phonetic events but are manifest by
03500 clues which overlap and extend for some distance from the central
03600 region which might arbitrarily be assigned to a given phoneme.
03700 Furthermore any one phoneme can and does occur in a variety of
03800 allophonic variations, which themselves are seldom pronounced the
03900 same even by the same speaker and in the same utterance. In
04000 continuous speech, many of the clues which establish the identity of
04100 a given allophone are modified by the environment of the alophone and
04200 some of them may be completely missing.
04300
04400 We attempt to deal with these complications by utillzing
04500 redundant sets of clues, obtained both from the steady or nearly
04600 steady portions of the wave form and from the transition regions. We
04700 handle the mass of data which this approach generates by using
04800 signature tables to correlate these data with significant features of
04900 the utterance and ultimately with the phonemic intent. The required
05000 multimodal relationships are establishes by means of training
05100 sessions and are expressed by probability values stored in
05200 the tables. Probability values are retained, not only for the most
05300 probable choice for each phoneme, but also for alternate choices. The
05400 higher level portions of a complete speech understanding system can
05500 then select alternate choices on a probabilistic basis whenever the
05600 first choices are found to be unreasonable in the light of syntactic,
05700 semantic or linguistic constraints.
05800
05900
05910 A second important difference between our work and that currently
05920 being done elsewhere has to do with our use of machine learning techniques.
05940 These techniques enable us both to establish desired relationships between
05945 acoustic clues and the phonemic intent of the speaker and to compensate for
05947 differences between speakers. Some of the relationships between the
05948 available acoustic input parameters and their phonemic interpretation
05949 are not at all obvious and all too often we have found that our a' priori
05959 evaluations were quite incorrect even when these were based on our
05970 acoustic input. All too often ourintuitive evaluation based on knowledge
05975 of earlier work has proven to be grossly in error.
05978
05980
05990
06000 While the Amanuensis approach is envisioned as a useful part
06100 of a complete speech understanding system it also provides a study
06200 tool which can be used to arrive at a better evaluation of the
06300 usefulness of different acoustic clues. In a final understanding
06400 system these clues might be extracted by the same or alternate
06500 computer methods or they might better be extracted by specially
06600 constructed hardware.
07000